feat: Enhance process_systems to recursively search all paths in systems list #5033

OutisLi · 2025-11-02T04:40:06Z

Description

This PR modifies the process_systems utility function to change how it handles list inputs.

Previously, if the systems argument was a str, the function would recursively search that path for systems. However, if systems was a list, the function would return the list as-is, assuming it was already a complete list of system paths.

This update unifies the logic. The function now treats every string path—whether it's a single str input or an item within a list—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability.

Motivation and Justification

The original implementation's inconsistent handling of str versus list inputs caused two significant problems:

Broken JSON Configurations: A very common use case, specifying a single data directory in input.json like "systems": ["/path/to/training_data"], would fail. The function would not search inside /path/to/training_data for the actual system directories (e.g., set.000, set.001, etc.).
Inability to Aggregate Data: It was impossible for users to combine multiple datasets by providing a list of top-level directories, such as "systems": ["/path/to/dataset_A", "/path/to/dataset_B"].

This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be.

Benefits

Fixes Bug: Correctly processes the common configuration of a single-item list in input.json.
Enables Data Aggregation: Users can now successfully provide a list of multiple data directories to be searched and combined.
Improves Consistency: The function's behavior is now intuitive and consistent, regardless of whether the user provides a single str or a list[str].

Summary by CodeRabbit

Documentation
- Clarified how training/validation system paths may be specified: a single system directory or a parent directory to recursively discover systems; lists of paths are explicitly supported and processed per-item.
Improvements
- Input handling for system paths enhanced to accept multiple paths and consolidate discovered systems, with expanded pattern-based discovery for more flexible data input.

Copilot

Pull Request Overview

This PR refactors the process_systems function to handle lists of system directory paths by iterating over each path and applying expansion logic individually. Previously, when systems was a list, it was simply copied; now each item is processed through the same expansion logic as single string inputs.

Key changes:

Modified process_systems to iterate over list items and expand each path individually
Updated documentation for better clarity on how the function handles both strings and lists

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
deepmd/utils/data_system.py	Refactored `process_systems` to expand each list item individually, improved docstring clarity, and added explicit type checking with error handling
deepmd/utils/argcheck.py	Updated documentation strings to clarify behavior for string and list inputs for both training and validation data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/utils/data_system.py

coderabbitai · 2025-11-02T04:43:27Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Updated documentation for training and validation data argument descriptions to clarify system path handling. Modified process_systems to accept either a single path string or a list of path strings, normalize to a list, expand each entry (via expand_sys_str or rglob_sys_str), and accumulate results into a consolidated list; invalid input types raise ValueError.

Changes

Cohort / File(s)	Summary
Documentation updates for data arguments `deepmd/utils/argcheck.py`	Revised docstrings for `training_data_args` and `validation_data_args` to state that a `systems` value may be a system directory path (containing `type.raw`) or a parent directory to recursively search for system subdirectories; when a list is provided, each string item is processed the same way.
Path processing enhancement `deepmd/utils/data_system.py`	`process_systems` now accepts a single string or a list of strings, normalizes input to a list, iterates over each path applying `expand_sys_str` (when patterns is None) or `rglob_sys_str` (when patterns provided), extends a consolidated `result_systems` list with each expansion, and raises `ValueError` for unsupported input types.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant process_systems
    participant expand_sys_str
    participant rglob_sys_str
    rect `#e6f7ff`
    Caller->>process_systems: call with systems (str or list)
    end
    alt input is str
        process_systems->>process_systems: normalize -> [str]
    end
    loop for each path in list
        process_systems->>process_systems: decide pattern present?
        alt patterns is None
            process_systems->>expand_sys_str: expand path
            expand_sys_str-->>process_systems: expanded_paths
        else patterns present
            process_systems->>rglob_sys_str: rglob path with patterns
            rglob_sys_str-->>process_systems: matched_paths
        end
        process_systems->>process_systems: extend result_systems
    end
    process_systems-->>Caller: return consolidated result_systems (list)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Extra attention areas:
- Edge cases: empty lists, duplicate paths, and mixed path types
- Backward compatibility with callers expecting previous return shapes
- Clarity and content of the ValueError raised for unsupported input types

Possibly related PRs

Perf: remove redundant checks on data integrity #4433: Overlaps with modifications to process_systems in deepmd/utils/data_system.py (related behavior/refactor).
docs: move arg docs from Variant to sub Arguments #4369: Related updates to argument documentation in deepmd/utils/argcheck.py.

Suggested reviewers

njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Enhance process_systems to recursively search all paths in systems list' directly and specifically summarizes the main change: updating process_systems to recursively search each path in a systems list.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/utils/data_system.py (2)
808-817: Consider TypeError and add list item validation.

Two suggestions for improvement:

Use TypeError for invalid types: When the input type is incorrect, TypeError is more semantically appropriate than ValueError (as suggested by static analysis).

Validate list items are strings: Currently, if systems is a list containing non-string items (e.g., ["/path", None] or ["/path", 123]), the error will occur later in expand_sys_str or rglob_sys_str, producing a potentially confusing error message.

Apply this diff to improve type handling:
 # Normalize input to a list of paths to search
 if isinstance(systems, str):
     search_paths = [systems]
 elif isinstance(systems, list):
+    # Validate all items are strings
+    for idx, item in enumerate(systems):
+        if not isinstance(item, str):
+            raise TypeError(
+                f"All items in systems list must be str, but systems[{idx}] is {type(item).__name__}."
+            )
     search_paths = systems
 else:
     # Handle unsupported input types
-    raise ValueError(
-        f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+    raise TypeError(
+        f"systems must be str or list[str], got {type(systems).__name__}."
     )
819-829: Consider deduplicating the final result for edge cases.

The current implementation may return duplicate system paths if:

Multiple paths in the input list overlap (e.g., ["/data", "/data/subset1"])

The same path appears multiple times in the input list

While rglob_sys_str deduplicates within each path search, expand_sys_str does not, and duplicates across different search paths are not removed.

If deduplication is desired, you could modify the return statement:
     result_systems.extend(expanded_paths)

-return result_systems
+# Deduplicate while preserving order
+seen = set()
+deduplicated = []
+for system in result_systems:
+    if system not in seen:
+        seen.add(system)
+        deduplicated.append(system)
+return deduplicated
Alternatively, if order doesn't matter:
-return result_systems
+return list(dict.fromkeys(result_systems))  # Preserves order in Python 3.7+
Or simply:
-return result_systems
+return list(set(result_systems))  # May change order

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b970b81 and b0d858f.

📒 Files selected for processing (2)

deepmd/utils/argcheck.py (2 hunks)
deepmd/utils/data_system.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data_system.py
deepmd/utils/argcheck.py

🧬 Code graph analysis (1)

deepmd/utils/data_system.py (1)

deepmd/common.py (2)

expand_sys_str (191-208)

rglob_sys_str (211-232)

🪛 Ruff (0.14.2)

deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)

815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (47)

GitHub Check: CodeQL analysis (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test C++ (true)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test C++ (false)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test C++ (true)
GitHub Check: Test C++ (false)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (3)

deepmd/utils/argcheck.py (2)

2994-2998: LGTM! Clear documentation of systems parameter behavior.

The updated documentation accurately describes the new behavior where both string and list inputs undergo recursive directory search, with each list item processed identically to a standalone string input.

3075-3079: LGTM! Consistent documentation across training and validation data.

The validation data documentation mirrors the training data documentation, maintaining consistency and clarity.

deepmd/utils/data_system.py (1)

787-807: LGTM! Updated docstring accurately reflects new behavior.

The docstring clearly describes the new uniform treatment of both string and list inputs, where each path is recursively searched for system directories.

codecov · 2025-11-02T05:37:41Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.23%. Comparing base (7778e2e) to head (71921b9).
⚠️ Report is 1 commits behind head on devel.

Files with missing lines	Patch %	Lines
deepmd/utils/data_system.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #5033      +/-   ##
==========================================
- Coverage   84.23%   84.23%   -0.01%     
==========================================
  Files         709      709              
  Lines       70075    70081       +6     
  Branches     3618     3619       +1     
==========================================
+ Hits        59027    59032       +5     
- Misses       9881     9883       +2     
+ Partials     1167     1166       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

njzjz

Please update the documentation:

deepmd-kit/doc/train/training-advanced.md

Lines 78 to 80 in 7778e2e

    
           - {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`. 
        
             - `list`: {ref}`systems <training/training_data/systems>` gives the training data systems. 
        
             - `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.

…nput handling

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

deepmd/utils/data_system.py (1)

808-817: Use TypeError for invalid type validation.

Static analysis correctly suggests using TypeError instead of ValueError when validating input types, as this is the more semantically appropriate exception for type-related errors.

Apply this diff:

     else:
-        # Handle unsupported input types
         raise ValueError(
-            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
         )
+        raise TypeError(
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+        )

Actually, let me fix that diff:

-    else:
-        # Handle unsupported input types
-        raise ValueError(
-            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
-        )
+    else:
+        raise TypeError(
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+        )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0d858f and 71921b9.

📒 Files selected for processing (3)

deepmd/utils/argcheck.py (2 hunks)
deepmd/utils/data_system.py (2 hunks)
doc/train/training-advanced.md (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data_system.py
deepmd/utils/argcheck.py

🧬 Code graph analysis (1)

deepmd/utils/data_system.py (1)

deepmd/common.py (2)

expand_sys_str (191-208)

rglob_sys_str (211-232)

🪛 LanguageTool

doc/train/training-advanced.md

[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ey can be a list or a str. - str: {ref}`systems <training/training_data/s...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...or all system subdirectories. - list: {ref}`systems <training/training_data/s...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Ruff (0.14.3)

deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)

815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)

GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test C++ (false)
GitHub Check: Test C++ (true)
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)

🔇 Additional comments (5)

deepmd/utils/argcheck.py (2)

2996-2998: LGTM! Documentation accurately reflects the enhanced behavior.

The docstring clearly describes both input variants (string and list) and how each is processed. The documentation is consistent with the implementation changes in process_systems.

3078-3080: LGTM! Consistent documentation for validation data.

The docstring mirrors the training_data_args documentation and accurately describes the behavior for validation data systems.

doc/train/training-advanced.md (1)

79-80: LGTM! User documentation is clear and accurate.

The documentation clearly explains both input variants and emphasizes that list items are processed the same way as individual string inputs, which helps users understand the recursive search behavior.

deepmd/utils/data_system.py (2)

819-827: LGTM! Clear iteration logic with proper expansion.

The iteration over search_paths with appropriate expansion logic (either expand_sys_str or rglob_sys_str) correctly implements the enhanced functionality described in the PR objectives.

808-829: Address deduplication inconsistency: expand_sys_str paths are not deduplicated.

The review comment raises valid but partially addressed concerns:

Empty list handling: An empty result_systems list is properly validated downstream. When process_systems returns an empty list, DeepmdDataSystem.__init__ (line 104-105) raises ValueError("No systems provided") with a clear error message. This is handled as designed.

Deduplication: Inconsistency exists between code paths:

When patterns is not None: rglob_sys_str already deduplicates via list(set(matches)) at line 232 of deepmd/common.py

When patterns is None: expand_sys_str does not deduplicate; if multiple paths in search_paths yield overlapping directories, duplicates persist

For consistency and robustness, consider adding deduplication at the end of process_systems (line 829) to handle all scenarios uniformly, or ensure expand_sys_str deduplicates when called via process_systems.

Copilot AI review requested due to automatic review settings November 2, 2025 04:40

github-actions bot added the Python label Nov 2, 2025

Copilot AI reviewed Nov 2, 2025

View reviewed changes

deepmd/utils/data_system.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

iProzd approved these changes Nov 3, 2025

View reviewed changes

iProzd requested a review from njzjz November 3, 2025 08:54

njzjz requested changes Nov 6, 2025

View reviewed changes

github-actions bot added the Docs label Nov 7, 2025

OutisLi added 6 commits November 7, 2025 15:09

feat:when input system is a list, it will recurrsivly find subsystems

205339f

doc:modify corresponding docstring

f236811

fix:missing else

583a435

refactor: streamline process_systems function and improve input handling

7cd6be1

update comments

ca6d147

doc: enhance documentation for training and validation data systems i…

71921b9

…nput handling

OutisLi force-pushed the pr/multi_system branch from 6d4ea7d to 71921b9 Compare November 7, 2025 07:09

OutisLi requested a review from njzjz November 7, 2025 07:10

coderabbitai bot reviewed Nov 7, 2025

View reviewed changes

njzjz approved these changes Nov 7, 2025

View reviewed changes

njzjz added this pull request to the merge queue Nov 7, 2025

Merged via the queue into deepmodeling:devel with commit 25fa707 Nov 7, 2025
60 checks passed

OutisLi deleted the pr/multi_system branch November 10, 2025 02:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enhance process_systems to recursively search all paths in systems list #5033

feat: Enhance process_systems to recursively search all paths in systems list #5033

Uh oh!

OutisLi commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Nov 2, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Nov 2, 2025 •

edited

Loading

Uh oh!

njzjz left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	- {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
	- `list`: {ref}`systems <training/training_data/systems>` gives the training data systems.
	- `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.

feat: Enhance process_systems to recursively search all paths in systems list #5033

feat: Enhance process_systems to recursively search all paths in systems list #5033

Uh oh!

Conversation

OutisLi commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Justification

Benefits

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

coderabbitai bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

njzjz left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OutisLi commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 2, 2025 •

edited

Loading

codecov bot commented Nov 2, 2025 •

edited

Loading